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1 Introduction 

We consider the problem of fitting a low rank tensor 

A G R 2 , I := Ii x ■ • ■ x I d , := { fj, G D := {1, ..., d}, 
to given data points 

d 

{Mi e R | i e P}, PcX, #-P>^n M , 

m=i 

by minimizing the distance between the given values (Mj)j g p and approximations (Aj)j S p: 

A = argmin — Ai) 2 (T being a certain tensor class) 

dsT jgp 

In the class of general dense tensors this is trivial, because the entries of the tensor are all 
independent. For sparse tensors this reduces to a simple knapsack problem. Our target 
tensor class is the set of low rank tensors, i.e., we assume that the implicitly given tensor 
M G R 2 allows for a low rank approximation 

||M — M\\ < e, £ G R>o, 

where the unknown approximant M G R 2 fulfils certain rank bounds that will be introduced 
later. In particular we allow £ = 0 so that the task is to reconstruct the whole tensor M — M 
in the low rank format. This particular case is considered, e.g. in DM- 

1.1 Completion versus Sampling 

A tensor fitting problem might arise as follows: the entries (M*)j G p could be measurements 
of a multiparameter model such that each index i G P represents a specific choice of d 
parameters. If the measurements are incomplete or in parts known to be incorrect, then the 
goal is to reconstruct all values of M for all parameter combinations « G I from the known 
values (Mj)i e p (prior to the assumption that M allows for an approximation in the low rank 
format). It is crucial that the points P are given and we are not free to choose them. In 
case that the points can be chosen freely one after another, the problem simplifies drastically 
and can be approached as in dii by an adaptive sampling strategy. Sometimes one can 
propose rules on how the entries from P should be chosen, as it is done in quasi Monte Carlo 
methods. This approach is persued in [S] and defines sampling rules that allow an efficient 
approximation scheme. Again, this is different and possibly a simpler task than the tensor 
completion considered here. 

1.2 Low Rank Tensor Formats 

The class of tensors in which we aim for a completion of the given tensor entries is a low 
rank format. In the case d — 2 the rank of a tensor coincides with the usual matrix rank, 
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but in dimension d > 2 there are several possibilities to define the rank of a tensor and thus 
there are several data-sparse low rank formats available. 

In the CP (k) forma10or representation 

k d k d 

-A 'y ' ® 9n,ii y ^ I | 9/x,((dn) ^ 

—* «=l ^—* 

t=l £=1 n= 1 

the tensor completion problem has been considered in |23. [lj Tl]. The minimal number of 
summands k by which the tensor A can be represented is the tensor rank of A, but minimality 
of k is often not relevant in applications. The CP (k) format is data sparse in the sense that 
storing the factors amounts to 0(dnk) units (real numbers) of storage, as opposed to the 
n d units of the full dense and unstructured tensor A. This is the reason for the attractivity 
of the format despite many theoretical and practical difficulties |9j. 

In the Tucker format 

ki k d d 

A, , = .A[Wv)> <w(v>eK. 

h=l td =1 M=1 

tensor completion has been considered in [201IT01 fT31. T7] . This format is limited to small 
dimensions d since the so-called core tensor C requires Yi^ = i un its °f storage. The ad¬ 
vantage on the other hand is that standard matrix approximation techniques can be used by 
matricizing the tensor. 

The low rank format that we consider lies in between these two, combining the benefits of 
both: the number of degrees of freedom scales linearly with the dimension d and the format 
is based on matricizations such that standard linear algebra tools are applicable. 

Here, we put no special assumptions on the data points P , except that they are reasonably 
distributed: 


Definition 1 (Slices and slice density) We define the slice density of a point set {Mi e 
M | i G P}, Pci, in direction p G D and index j^E 2), by 


c Un) ^ P I V — 91 1 } 

The corresponding slice of a tensor A E is defined by 
A; := A E x-xid 


A ■ •— A■ 


Depending on the rank parameters of A (which in turn depend on the target accuracy of 
the approximation) the slice densities of the set P have to be high enough, i.e. 


c(Jn) > C SD rl, 


3n P-iii 9 E D , 


for a constant Csd, the oversampling factor or overall slice density relative to the rank. Note 
that thereby, the minimal value for ffP increases if any ffT^ does. If one of the values c(j M ) 

1 CP stands for canonical polyadic, in the literature also called CANDECOMP and PARAFAC 
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were zero, then this simply means that the slice A, ( =J/j is undetermined and not observable 
for any of the low rank formats mentioned above and in the following. In a minimum norm 
sense the completed tensor could be set to zero for this slice without any effect on the rank 
or approximation in the known points P. 

The low rank format under consideration is the hierarchical [6j @J or TT [[T5l CE3] or MPS 
I'JTi. |21] format. 


*1 = **l f 
* 1=2 | 
i 1 = l t 



Gi(0 



(') G>(-) Gd- 1(-) 


u 

. rfl' 

G d (-) 


Figure 1: The TT representation of a tensor in TT(ri,... ,r d _ 1 ) with G^ii^f) G 


Definition 2 (TT tensor format) Let r 0 , ..., r d G N and r 0 = r d = 1. A tensor AeR 1 
of the form or representation 

Ai u ...,i d = G^h) • • • G d (i d ), G^(v) e (1) 

for all i el and G M —>• M r /*- lXr M ^ to 6e of MPS (matrix product states) format or 

TT (tensor train) format or hierarchical format, cf. Figure [7} We define the set of tensors 
in TT format by 

TT(ri,... ,r d - 1 ) := {A G M 2 | A is of the form ([!])}. 

The parameters are called representation ranks and combined to the rank vector r: = 
(r 1; ..., r d _i). For the matrix blocks (G' /( )^ =1 we use the short notation G. G is called a 
representation system of A, and if we want to indicate that A is represented by G we write 
A G . 


The minimal ranks for the representation of a tensor A in TT format are the ranks of 
certain matricizations of A m ns]- 

The number of parameters in the TT representation is 

d 

r^-ir^n^ ~ 0(dr 2 n ), r := max n := max n M . 

ij,=i 


It could thus in principle be possible to reconstruct the tensor from a num ber of samples 
that is in (P(logiV) for a tensor having N = nf=i n * entries, cf. Section 


4.3 
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1.3 Statement of the Main Approximation Problem 


The full approximation problem can be stated as follows. For S C X let 

if i e S 


\X\\p : = 


(A'| s ).: = 


iex 


otherwise 


\\x\\s ■- II^IsIIf- 


Problem 3 (Main problem) Given a tensor M 6 known only at points P C X, and 
given representation ranks r±,... ,rd~i, find a representation 0 with representation system 
G such that A = A G fulfils 

A= argmin ||M — A\\p. 

AeTT(n,...,r d - 1 ) 


A related approach for tensor completion is presented in [2TJ where the authors use a 
steepest descent iteration on the tensor manifold. Our approach is an alternating least squares 
minimization and an overrelaxation based on ideas from LMaFit for matrix completion 


A short comparison is given in Section 4.5 


1.4 First Order Optimality Conditions and ALS 

For a representation system (G lx )j k _ 1 such that A = A G one can write the main problem in 
the form 

G = argmin \\M — A G ||p. 

G 

The direct first order optimality conditions for the matrix blocks G M are 

G M = argmin \\M — A G ||p, G u := G v for 

i.e. each matrix block G M is optimal when all other blocks are fixed. Starting from some 
approximation G, the alternating least squares approach from [7] consists of an alternating 
best fit for each of the blocks G M in the order p = 1,d. It should be noted that the 
order can as well be chosen as p = d,..., 1 or any other permutation. However, for practical 
purposes, the most straightforward choice seems to be either one of the aforementioned 
orderings, cf. Algorithm [T} 

Algorithm 1 Alternating Least Squares (ALS) algorithm 
Require: Initial guess A G 

while stopping condition not fulfilled do 
for p — 1,..., d do 

Determine G M := argmin G \\M — A G ||p 

end for 
end while 


Remark 4 (Slice-wise optimization) The minimizer G M in each step of Algorithm^ can 
be found slice-wise, since each slice yields an independent least squares problem: 

G vUv) '■= ar 9min Gix{j J\M i)i=jii - A G =j J {p£ p lp ^ =jii} 
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1.5 Alternative Optimality Conditions and ADF 

An alternative formulation of our main problem is based on LMaFit ideas [25] and given by 
introducing an additional tensor Z G M 1 so that G can be found via solving 

minimize f(G, Z) := \\Z — A g \\f s.t. Z\p = M\p, A G G TT{r \,..., r^-i). 

The latter function / yields first order optimality conditions 

Z\x\p = A G \p\p and G M = argmin ||Z — A G ||i?, G v : = G v for v j - /n 

Solving this nonlinear system of equations simultaneously for G\,, Gd, Z is not trivial. In 
a hard or soft thresholding iteration, one would have to find a best approximation A G to 
a given tensor Z, and in the matrix case d — 2 this is expensive but possible. For tensors 
in d > 2 such a best approximation is not available. A common technique for finding a 
quasi-optimal approximation is an alternating optimization approach, cycling through the 
unknowns G M (as above in ALS). But since our final goal is not the approximation of Z but 
the minimization of /, it makes sense to directly solve the nonlinear system by an alternating 
fit. We approach this nonlinear system by a nonlinear block Gauss-Seiclel iteration where the 
blocks of unknowns are Gi,, Gd, Z: 

Require: Initial guess A G 
for i=l,... do 

For all i G X \ P set Zi Af and for all i G P set Z % := Mi 
For all n G D minimize \\Z — A G || j p with respect to G^ 

end for 

Finally, we use (partial) successive overrelaxation in order to speed up the convergence. We 
call the resulting algorithm ‘alternating directions fitting’ (ADF), cf. Algorithm [2] (where 
the overrelaxation parameter still has to be specified). 


Algorithm 2 Alternating Directions Fitting (ADF) algorithm 
Require: Initial guess A G , overrelaxation parameter a > 1 
while stopping condition not fulfilled do 

For all i G X \ P set Zi Af and for all i G P set Z % := M* 
for n — 1,..., d do 

Determine Gf := argmin G \\Z — A G ||^ and set G M := G M + a(Gf — G^) 

end for 
end while 


1.6 Organization of the Article 

In Section[2j we introduce the necessary tools for the analysis and algorithmic treatment of the 
tensor approximation problem. Section [3] presents the ALS and ADF algorithm in detail and 
analyses the computational and storage complexity of one iterative step. Several practical 
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issues like adaptive choice of the ranks, improved performance, and stopping criteria are 
developed. Finally we greatly simplify the determination of the overrelaxation parameter a. 
In the numerical examples in Section[4j we apply the algorithms to three types of examples: a) 
smooth function related tensors, b) functionals of parametric PDE solutions, and c) random 
low rank tensors with and without noise. We conclude our Endings in Section [5} 


2 Optimization in the TT-Format 


In this section we introduce the neccessary tools to work with matrix blocks in order to derive 
and formulate the core step of the ALS and ADF algorithm (Theorem 23). 


2.1 Matrix Blocks 

First we introduce matrix blocks, which are a useful tool both for tensor calculus and arith¬ 
metic in TT representation. 

Definition 5 (Matrix block) Let ki,k 2 ,n G N. We define a matrix block H G (M, klXk2 ) n 
as a vector of matrices H( 1),..., H(n) G M fclXfc2 . We call k\ x k 2 the dimension and n the 
length of H. 

Remark 6 a) In j[7^ a matrix block H is called a component function H(-). We use the 
name matrix block to point out that H has the structure of an array of matrices, b) For 
fixed ki, k 2 G N the set of matrix blocks H G (M fclXfc2 ) n forms an M.-vectorspace as well as a 
left-module over the non-abelian matrix ring M fclXfcl and a right-module over M fc2X/C2 . 

Matrix blocks can be combined via the Kronecker product to form higher dimensional 
tensors as they appear in the definition of the TT representation A G . 

Definition 7 ((Kronecker) product between matrix blocks) We define the (Kro¬ 
necker) product 0 for matrix blocks Hi, H 2 of dimensions k\ x k m , k m x k- 2 and lengths n\, n 2 
as 

(Hi®H 2 )((i,j)):=Hi(i)H2(j) 

where (Hi 0 H 2 ) is a matrix block of dimension k\ x k 2 and length nin 2 . 

The definition is consistent with the conventional Kronecker product such that associativity 
is given. 

In order to simplify the notation we use the following convention: 

• We treat the product of a matrix and a matrix block as if the matrix was a block of 
length 1 and skip the 0. It is referred to as pointwise multiplication. 

• We write (H\ 0 ... 0 H n )(ii... i n ) instead of (H\ 0 ... 0 H n )((ii... i n )). 

• The empty Kronecker product is defined to be / (identity matrix of suitable size). 
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Remark 8 (Generating A G ) Using the Kronecker product, one can express A G by 

A fh,..., id ) = (Gi ® ® G d ){i i,..., i d ), A G = ® ® G d . 

In order to apply standard matrix tools, we have to switch between matrix blocks, matrices, 
and tensors. The necessary foldings and unfoldings are introduced in the following. 


Definition 9 (Left and right unfolding, transpose) Let H G (M fel xfe2 ) n a ma t r i x 
block. We define the left unfolding C(H) as 







C(H) : = 

m 2) 


&W) 



H(n) 



LJ 


and the right unfolding 7 Z(H) as 


( 2 ) 


TZ(H) := [H( 1) ff(2) 


tf(n)] G M fclXnfc2 . 



(3) 


The transpose H 7 of a matrix block is a matrix block defined by H T (i ) := H{i) T . 

In [18] the left and right unfoldings C(H) and 7 Z(H) are denoted by H L and H R . We 
adjust the notation to our requirements and in order to illustrate that they are mappings. 


Remark 10 (Conjugacy of block operations) The left and right unfolding are conju¬ 
gate operations by means of 

C(H) t = TZ{H t ) 

Definition 11 (Left and right s-unfolding of a representation) For a representation 
G as in Definition ^ we denote the left s-unfolding by 

G <s :=£{G 1 ®...®Gs-eR n<Sxrs -\ n <s = 

ll<s 

and likewise the right s-unfolding by 

G >s := TZ(G S+1 <g>... ® G d ) G M rsXn>s , n >s = JJ n M . 

fj.>s 

We shortly call these just unfoldings and skip the index s. 


Definition 12 (Block matricization) Let A G M 1 be a d-dimensional tensor. A block 
matricization with respect to s G (1,..., d}, Ar^, is defined as the matrix block of dimension 
(■7i!... n s _i) x (n s+ i... nfi and length n s , given by 

(^4(s)(*s))(ji,...,i a -i),(i s+ i,...,i d ) - = A ii,...,i d i ^s £ ^ s . 

In case that A is a tensor in TT format with representation A = A G , the block matricization 
is simply (cf. Figure [2]) 

aG W1<s S~1>S 

-^(s) ^ GT S Lt 
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G >s of the left un- 


r 


JJ1 1LIJJ Ll_l J 



“in t i rrin rrn 

_l _l _L J_ L l_l J Ll_l J 


Figure 2: The block matricization of A G is the product A G S ^ = G <s G s 
folding times matrix block times right unfolding. 


2.2 Scalar Product and Orthogonality 

The standard scalar product can be transfered to matrix blocks as follows. 

Definition 13 ((Scalar) product of matrix blocks) Let G and H be matrix blocks of 
dimensions k\ x k m ,k m x k- 2 and same length. Then we define their (scalar) product as 

(G,H) :=^2G{i)H(i) = K(G)£(H) E R klXk2 . 


For a matrix J E R kmXkm we define 

(G, J, H) := (GJ, H) = (G, JH). 

Note that (•, •) is only a product with scalar output regarding its module properties. 

Definition 14 (M-scalar product and matrix block norm) Let V := (M fclXfc2 ) n be the 
R-vector space of matrix blocks of dimension k\ x k 2 and length n. Then (-, •) defines a scalar 
product on V via 

(G, H) r := trace(G, H T ) = trace(G T , H), G,H e V. 

The corresponding norm || • || on V is defined as ||G|| := y/(G, G) R . 

Remark 15 (Properties of the matrix block norm and scalar product) For a ma¬ 
trix block G, tensor A and index s E D, it holds 


l|G|l 


/£ l|G(i) 


IF) 




The M scalar product hence coincides with the standard scalar product between the according 
vectorizations of the matrix blocks. 

We introduce the concept of orthogonality (cf. (Tj) for matrix blocks, by which we can 
simplify the minimization problem. 
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Definition 16 (Orthogonality of matrix blocks) For a matrix block H, we call H 
left orthogonal if the columns of C(H) are orthogonal (this being (H T ,H) = I), and right 
orthogonal if the rows of 71(H) are orthogonal (this being ( H , H T ) = I). 

Let Q be a matrix block of same dimensions as H. We then define the (non-unique) operation 
orth 1 such that for Q = orth £ (H), the pair (C(Q),R) is a QR-decomposition of C(H). Then 
Q is left orthogonal and QR = H. 

Likewise orth r is such that for Q = orth r (H), the pair (. L,IZ(Q )) is an LQ-decomposition of 
71(H). Then Q is right orthogonal and LQ = H. 


In Corollary [18j we demonstrate how orthogonality, the scalar product and the Kronecker 
product are used to show the feasibility (Theorem 26) of the ADF core step (Theorem 23). 


Lemma IT (Scalar products of Kronecker products) Let Gi,G 2 and Hi,H 2 be ma¬ 
trix blocks of appropriate dimensions and lengths. Then 


{(Gi ® G 2 ) t , H x <g> H 2 ) = <<#, (Gf, H 1 ), H 2 ), 


respectively 


(Gd ® G 2 , (H x ® H 2 ) t ) = (Gi, {G 2 ,H?),H?). 


Proof: Due to symmetry we consider only the first case. By definition and reordering of 
summation, we obtain 


((Gd ® G 2 ) t , H x ® H 2 ) = ^((Gd ® G 2 )(i)) t (Hi ® H 2 ){%) 

i 

= Y, G ^) TG ^) T ^i)H 2 (i 2 )) =^G 2 (z 2 ) T ^(G 1 (^ 1 ) T i/ 1 (2 1 ))iT 2 (z 2 ) 

* 1,*2 72 il 

= Y, G ^ T ( G 1’ = (Gl (Gf, Ht), H 2 ). 

72 


Corollary 18 (Orthogonality of Kronecker products) If Gd 

in Lemma\l7\ then 

((Gi ® G 2 ) T , Hi ® H 2 ) = (G t 2 , H 2 ). 


If G 2 = H 2 are right orthogonal in Lemma 11, then 


Hi are left orthogonal 


(Gd ® Gd, (Hi ® H 2 ) t ) = (Gi, H((). 


Remark 19 (Non-uniqueness of representations) In the TT-format, the representa¬ 
tions are highly non-unique m This degree of freedom can be an advantage: one can 
always assume that all matrix blocks Gi are left orthogonal for i < h and right orthogonal 
for i > h. Then G is called orthogonalized with respect to h, or in short h-orthogonal. This 
concept is also described in m where Gh is called core ofG. It follows that ||kL G '|| F = ||Gd|| • 
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3 The ALS and ADF Algorithm 

We first approach Problem [3] by the ALS Algorithm [lj for which we introduce the rank 
increasing strategy in detail in Algorithm [3j We then derive the optimality conditions of 
this problem with respect to a single block, which is the basic step of the ADF Algorithm 
[ 2 ] We adapt the stopping criteria, previously given for the rank increasing ALS algorithm, 
and provide a useful heuristic for choosing the overrelaxation parameter a (Remark [30] and 
Algorithm [2]) . Finally, we greatly simplify the choice of a. 


3.1 Rank Increasing Strategy and Alternating Least Squares 


In this section we assume that a target rank rfi na i is given and that we are interested in a 
tensor completion scheme with equal ranks r\ — ■■■ — r d ~i = rfi na i in the TT format. A 
successful strategy for finding good initial values for the optimization is to start with minimal 
ranks ri — ... — r d -± = 1. Each time the algorithm fails to progress sufficiently (cf. Remark 


21 ), the ranks r M of G are increased until the final target rank Vf ina i is reached. 


Remark 20 (Initial values) We start our approximation scheme with equal ranks rq = 
. . . = rd- i = 1 and matrix blocks 





Vs, i. 


G is thereby uniform and each block is orthogonal. The adaption of the representation G to 
ranks r + 1 is done in a straightforward way. The two matrix blocks Gi,Gd are replaced by 


Gi(i) <!— [Gh(i) 


Gd(i) <— 


G d (i) 


while the other matrix blocks are replaced by 


G s (i) <— 


G s (i) 
0 



Vi. 


Vi, 


This results in an initial guess which is the sum of the previous (lower) rank approximation 
plus a rank one term as above. 


Remark 21 (Stopping criteria) Our rank increasing scheme needs a robust stopping cri¬ 
terion for the least squares fixed rank optimization. Here, we use the heuristic that when¬ 
ever the improvements of one sweep are too small, we stop the fixed rank optimization 
and increase the rank parameter, where ’sweep’ refers to one alternating cycle through all 
directions. Let ( 7)5 denote the arithmetic mean of the last 5 residual reduction factors 
(Res(G) := \\A G — M\\ P after a sweep): 


Res(G i ) 
Res(G *-!)’ 


i = iter — 4,..., iter, 


( 7)5 := 


Titer—4 + ••• + Titer 

5 


11 









Then we stop the fixed rank optimization if 

11 (T) 5 | &stop 

where reasonable choices for e stop vary between 10~ 2 and 10 -5 . 

The final algorithm with our choice of starting values is given in Algorithm [3j The or- 
thogonalization of G with respect to s in the inner loop is not necessary but improves the 
stability and can be performed without significant increase in computational complexity. 


Algorithm 3 Rank increasing ALS algorithm 

Initialize the representation G for r = 1 (Remark [20|; 
for r = 1... r final do 

for iter = 1,.. ., iter max do 
for s — 1 ,..., d do 

orthogonalize G with respect to s; 
update G s t— argmin Gs \\A G — M ||p; 
end for 

if stopping criteria apply then 
stop the iter loop; {Remark |2l|f 
end if 
end for 

adapt representation to r + 1; {Remark |20jf 
end for 


Lemma 22 (Computational complexity) The computational complexity for one sweep 
of the ALS algorithm for rank r is in 


C>(r 4 d#P) 

Assuming that for each rank r — 1,... ,rfi na i we require iter max many sweeps of ALS, we 
obtain a total complexity of 

(D(iter max r 5 d#P) 

Proof: The estimate for the total complexity obviously follows from the first one. For one 
sweep we have to determine each of the blocks G s once by setting up and solving a linear least 
squares problem. Naturally the least squares problem decouples into n s independent linear 
least squares problems of size ff{p \ p £ P,p s = i s } x r 2 . Solving these for all i s — 1,..., n s is 
possible in 0(ffPr 4 ), and summing this up for all directions s — 1,... ,d gives a complexity 
of 0(dffPr 4 ). In addition to the pure solve, we have to setup the least squares matrix, and 
we orthogonalize G with respect to s. 

The orthogonalization step is independent of P and of negligible complexity 0(dnr z ) 
jam- Setting up the least squares matrix requires ffP times the (partial) evaluation of 
the tensor A , which is of complexity 0(dr 2 ) per entry, leading to a negligible complexity 
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of 0(r 2 d#P). 


For the convergence of the ALS iteration, we state the result from [T8I Theorem 2.10]: 
under suitable full rank assumptions on the Hessian in the local minimizer, the ALS iteration 
converges locally at least linearly to the local minimizer. 


3.2 The ADF Core Step 

The core step we outline below describes how the update of G in Algorithm [2] in the unac¬ 
celerated case is performed. 


Theorem 23 (Core step of the ADF algorithm) Let s £ {l,...,d}. Without loss of 
generality, we assume that G is orthogonalized with respect to s (cf. Remark 19). 

Then the minimizer G s in Algorithm [£[ for all j £ T s , is given by 


G s (j) = ( G< S ) T Z {s) (j) (G> s ) t 

= Y, Z i {G 1 (i 1 )...G s . 1 (i s ^ 1 )) T (G s+1 (i s+1 )...G d (i d )) T ^ 

i£l,i s =j 


Proof: By assumption, G i,..., G s -i are left orthogonal and G s+ i,... ,Gd right orthogonal. 
Therefore G <s has orthonormal columns and G >s orthonormal rows. Then 


G s = argmin \\Z — A G || F = argmin ||Z( S ) — A£\|| = argmin ||Z( S ) — G <s G s G >s || 

Gs Gs Gs 

and, due to orthogonality, it follows that, for all j £ Z s , 

G s (j) = argmin \\Z {s) (j) - G <s G s (j ) G >s || = argmin ||(G <S ) T Z {s) (j) ( G >S ) T - G s (j)||. 

G s (j) Gs(j) 


The core step above is formulated without any overrelaxation. The overrelaxation param¬ 
eter a can however be included directly into the core step by modifying Z as follows. 

Lemma 24 Let A G = G\® ■ ■ ■ ® Gd £ be given, a £ M, Z £ and 


Gf := argmin \\Z {s) - G <s G s G >s ||. 

Gs 

Then Gf := aGf + (1 — a)G s satisfies 

Gf = argmin || Zf s) - G <s G s G >s || (4) 

G s 


for Z a := aZ + (1 — a)A G . 
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Proof: We assume, by contradiction, that there exists G s G" satisfying G s = aGf + (1 — 
a)G s and 

II ryOt y'i /~1>S || || ryOt /^1<S /^>S|| 

||Z.( S )— Or U s Ur II < ||-^( s ) — CJ Or s U ||. 

Inserting Z a , G s and G" leads to 

|| Z( s) - G <s G s G >s || = ||aZ (a) - aG <s G+ G >s + (1 - a)(Af s) - G <s G s G >s )|| 

< ||Z ( “ } - G <s G“ G >s || = ||aZ w - aG <s Gf G >s + (1 - a)(Ag } - G <s G s G >s )|| 
which is equivalent to 

a||Z (a) - G <s G+ G >s || < a||Z (s) - G <s G+ G >s ||. 

This is a contradiction to the minimality of Gf. This proves that G“ is the minimizer of the 
minimization problem Q. ■ 


Remark 25 (Denoting current and old representations within sweeps) The inter¬ 
mediate tensor Z a is not updated along with the representation, but in chosen increments, 
namely after each sweep. During each sweep, we denote with G~ the old representation used 
for the last update of Z a and with G the current representation. Therefore Z a is always based 
on the old representation. 


Theorem 26 (Practical ADF core step) Under the assumptions of Theorem 23. the up¬ 
date block G s with overrelaxation parameter a is given, for all j e T s , by 


G s (j)=(G <s y (G~) <s GJ(J) (G~) >s (G >s ) t 

^ ✓ s. y 

(LSI) USD 


(5) 


+ ot(Mi — A i ) (Gi(*i)... G s _i(z s _i)) (G s+ i(z s _|_i)... Gd(if)) (6) 

- .. - ✓ - - -' 

ieP,i s =j 


(LM})i 


(LMDi 


(The short notations are used for Lemma 21.) 


Proof: According to Theorem 23 and Lemma 24 we have 

G.U) = ( G <S ) T Zfaij) {G> S ) T . (7) 

Z = A G |j\p + M\p (cf. Algorithm [ 2 ]) and Z a = aZ + (1 — a)A G (cf. Lemma 24) yield 

Z Q = A g " + a(M\p — A g ~ |p) 


°AFirst summand 


^Second summand 


which we insert into (JT]) . 

First summand: Recall that A?, can be expanded (cf. Figure 2). From the definition of 
( G <s ) and ( G >s ) (cf. Definition [ilj), we derive that 

( 8 ) 


(G < ") t (G>-y = (G<T (G-)<* G~(j) (G-)>* (G>*) 


,>s\T 


<s\T (rr— \< s r^~{ 


'i — \>s (r<>s\T 


Second summand: As (M\p — A G |p), = 0 for all i P, we can reduce the summation 
from I to P and obtain the formula stated in the theorem. ■ 
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3.3 Computational Complexity of ADF 


The statements presented in this subsection are based on the sweep with order 1 —> d, but 
can be transfered to permutations. 


Lemma 27 (Successive computing) The occuring terms in the core step (Theorem 26) 
during the sweep (s — 1 —>• d) can be reduced to simpler successive computations. Note that 
in step s, the right matrix blocks G s+ 1 ,... ,Gd are unchanged and equal to those of the old 
representation G~. We then have that 


(LS]) = (G T s _ l ,(LS 1 s _ l ) 1 G:_ l ), (9) 

where ( LS \) = 1, while (LSI) = I (the identity matrix) due to the orthogonality conditions. 
Likewise 


{LM])i = G s -i(i s -i) T (. LM]_ Ji, (10) 

(LM% = (LM%G~ s+l (i s+1 ) T (11) 

where (LM\) = 1. Hence, while (LS 1 ) and (LAP) are updated within the sequence, ( LM 2 ) 
is calculated before. Furthermore, (LM)) and (LMf) can be used to update A G \p. 

Lemma 28 (Computational complexity) Let r := max{ri ,..., r^-i} and n : = 
max{n\,... , iid}. The complexity for one full sweep of updating Z, G\,... ,Gd In the ADF 
iteration is 

0(r 3 dn + r 2 dffP). 


Proof: We analyze the operations in Lemma |27| and Theorem [26] for a step s within a sweep 
s — 1 —)- d\ 


1 . 

2 . 

3. 

4. 

5. 

6 . 


2 n times an (r x r) times (r x r) matrix multiplication: 0(nr 3 ). 

([To]) & (0: 2 ffP times an (1 x r) times (r x r) matrix multiplication: 0(ffPr 2 ). 

([5]): 2 n times an (r x r) times (r x r) matrix multiplication: 0(nr 3 ). 

p times evaluation of A G , by using the values ( LM]),(LM 2 ): 0(ffPr 2 ). 

ffP times an (r x 1) times (1 x r) matrix multiplication: 0(ffPr 2 ). 

switching orthogonality of G: one QR decomposition of an nr x r matrix and n times 
an (r x r) times (r x r) matrix multiplication: 0(nr 3 ). 


Each of these steps is performed 0(d) times. 

This leaves us with the computational complexity of one left-hand sweep of 0(r 3 dn + 
r 2 d#P). 


Remark 29 (Complexity of ALS and ADF) The computational complexity of one 
ADF sweep is in 0(r 3 dn + r 2 df(P), whereas an ALS sweep is in 0(r 4 d)fP) (cf. Lemma 
22), i.e., asymptotically an ADF step is by a factor r 2 faster than an ALS step. In the 
numerical examples section we compare the speed and the necessary number of iterations for 
several examples. 
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3.4 Preliminary choice of the SOR Parameter a and Stopping 
Criterion 

By an optimized determination of the acceleration parameter a, one can speed up the con¬ 
vergence of the ADF algorithm considerably. Therefore, after each sweep of the ADF Al¬ 
gorithm [2j we allow a relatively expensive search for a suitable a by testing increased ( a up ) 
and reduced ( a down ) values of a until the residual decays (or we break). The corresponding 
representations are denoted by G up , G down , and the direction (up, down or back) is denoted 
by dir. The residual error is denoted as above by Res(G) := || A G — M||p. 


Remark 30 (Determination of the overrelaxation a) To handle the acceleration pa¬ 
rameter a, we introduce a second parameter 6, an increment parameter. Each sweep is run 
for two different accelerations (a up ,a down ): 

a up := a + 5, a down := max{ 1, a — 5/5}. 


This choice ensures that the overrelaxation parameter is at least a > 1. Depending on the 
residuals of the results, one of the three directions is chosen as specified in Algorithm [/} It 
determines the new a, 5 as well as G. 

In order to estimate and understand the magnitude of a, one can view the summand 0 
as a spot-check evaluation of the same term but for P = I, which would represent a full, 
maximal sampling set. Therefore, it has to be multiplied by &p. For the initial acceleration 
parameters needed for the ADF algorithm, we obtain 


a := 


#X 

#P’ 


5 := 


a 

4' 


Algorithm 4 Choice of the SOR parameter a 

Notation: \ 5 means 5 := |5, ^ 5 means 5 := min(a 6acfc /10,1.25) 


The values G up and G down for overrelaxations a up and a down are already computed 
if Res(G up ) > Res(G) and Res(G down ) > Res(G ) then 

Set a |(1 + a), \ 5 and dir := back , then recompute G up and G down 
Restart Algorithm [2] (break if this happens more than 10 times in a row); 
else if ( Res(G up ) < Res(G down )) then 
If dir = up then 5, otherwise \ 5; 
a := a up ; G := G up ; dir := up] 
else if ( Res{G down ) < Res(G up )) then 
If dir = down then 5, otherwise \ 5; 
a := a down ] G := G down ] dir := down] 
end if 


Finally, we need an adaptive reliable stopping criterion in conjunction with the rank- 
increasing strategy discussed previously. 
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Remark 31 (Stopping criteria) We denote again by (7)5 the arithmetic mean of the last 
5 residual reduction factors 


ResiG*) 

7i:= SS(G^j’ 1 ~ iter — 4.. . iter 

Our first stopping criterion is simply like for ALS 

| 1 (a ) 5 I ^ &stop 


with e between 10~ 2 and 10~ 5 . 

However, this is only tested if the direction is dir = down or the last residual reduction 
fulfils: |1 — ^j-| < 10 -7 . Note that we cannot compare the specific £ stop of ADF with the 
one of ALS, as ADF is faster in time but with smaller residual reduction per iteration. Our 
second stopping criterion is: Stop if the last 10 directions were dir = back, meaning there is 
no residual reduction even if the SOR parameter a approaches 1. 

A detailed analysis by numerical experiments on the optimality of a from the above heuris¬ 
tic is given in the supplementary material. We can summarize that even an expensive line 
search to determine the optimal a for each sweep gives almost the same results as the simple 
heuristic. This motivates the simplified determination of a in the next subsection. 

3.5 Automated Overrelaxation in Microsteps 

The idea for the automated overrelaxation is not to choose one a for the whole sweep s = 
1 —>■ d but rather a different a = a(s) for each (micro-) step s. As it will turn out this 
enables us to determine the optimal o(s) and interprete the iteration as an approximate 
ALS iteration. 

Definition 32 (Residual tensor and matrix block projection) We define the residual 
tensor S and the matrix block projection P s via 

Rm '■= ( M ~ A G )\ Pl (A( s) )|p s := (A|p) (s ), 

such that for any s G D: ( Sm)( s ) = (M( s ) — G <s G s G >s )|p s . When the context is clear, we 
skip the indices M or G. 

We recall that the tensor Z a is given by 

= A g ~ +a(M\ P - A G '\ P ) = A G ~ +aS G ~. 

If we assume that G = G~ is s-orthogonal and we determine the update only in direction s 
(instead of the whole sweep 1 —>■ d), then the update used in ADF simplifies to 
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G?(j) = (G <S ) T Z? s) (j) ( G> s ) t ■ 

= (G <S ) T ( G~) <s G~(j) (G~) >s (G >S ) T 

S. ' ✓ >s ' ✓ 

=7 =7 

+ a (Mi — Af ) (G\(ii)... G s -i(i s _i)) T (G s+ i(i s+ i)... Gd(id)) T 

i£P,i s =j 

^--v--> 

=:N(j) 

= G~(j) + aN(j). 

For the whole matrix block this is N — G <sT S G ^ G >sT . 

Lemma 33 (Optimal acceleration) The optimal overrelaxation parameter 

a* ■= a *( s ) ■= argmin ||G <S G“ G >s - M (s) || Pa 

Ot 

for the update of block s is given by 

a* = \\N\\ 2 F /\\G< s NG> s \\ 2 Ps . 

Proof: The optimal a* from the quadratic minimization is 

a* = ( G <s N G >s , (M {s) - G <s Gj G' >s )| Ps )/||G' <s N G >8 \\ 2 Pa . 

Finally, the trace properties can be used to simplify the nominator: 

(G <s N G >s , (M (s) - G <s G~ G >s ) Ps ) 

n 

= trace((G <s N(j) G >S ) T (M w (j) - G<‘ G~(j) G >’) P .) 

3 =1 
n 

= V trace(N(j) T G <>T (M w (j) - G <s G;(j) G>‘) P , G>‘ T ) 

3 = 1 

= ( N,G <sT (M (s) - G <s G~ G >s ) Pg G >sT )) 

= {N, G <sT Sf~ G >sT ) = (TV, N) 


Note that the change in the residual tensor has already been calculated for the determi¬ 
nation of a*: Sf s) = S G ~ - a (G <s N G >s )\ Ps . Furthermore \\S G ~ || 2 - ||S G || 2 = a ||iV|| 2 . We 
summarize the final ADF in Algorithm [5j 

Remark 34 (Overrelaxation in Microsteps) The overrelaxation parameter a does not 
need to be uniform for the whole block G s . We can proceed with each part G s (j),j — 1,..., n s 
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Algorithm 5 Rank increasing ADF algorithm 

Initialize the representation G for r = 1 (Remark [20|; S := (Definition [32|; 
for r = 1,..., rf ina i do 

for iter = 1,..., iter mox do 
for s — 1,..., d do 

s-orthogonalize G and calculate 

N ■= G <sT S {s) G >sT , Z n ■= ( G <s N G >s ) Pa 


, ||iV|| 2 F r T 

set a \= I,' ,2 ; Lemma 

W z m\p s 

update 


33 


Remark 


34 


resp.} 


fib := G. + aN. 


Q. 




end for 

if breaking criteria apply then 
stop the iter loop; {Remark |2l|f 
end if 
end for 

adapt representation to r + 1; {Remark |20jf 
end for 


seperately due to their independency. N and Z jv remain the same and the optimal a* * for 
slice j is 


* \\N(j)\\ 2 F 

/ . — - 

J ||G<- N(j) G>% 3 


for (A (s) (j))| P i 


(^4|p)( s )(j), j = l,...,n s . 


Hence, we update G s (j ) = G s ( j ) + ajN(j). This is what we use in practice as it typically 
gives a lower residual for the same computational complexity. 


Finally, we can interprete the ADF iteration with overrelaxation in microsteps as an ap¬ 
proximate ALS iteration: The block N as defined above is the gradient of the residual 
function in mode s. That is, for 

R s : M r s-i XrsXns _► M, G s ^^\\M - A g \\ 2 p , 

we have N = V R s . Therefore the ADF (micro-) step is an alternating best approximation 
of the blocks G s , s = 1 —> d, but only in the direction N of steepest descent (after s- 
orthogonalization). In the numerical examples we observe that indeed ADF requires a few 
more iterative steps, but since the complexity is by a factor r 2 lower, this is advantageous. 
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4 Numerical Experiments 

4.1 Data Aquisition and Measurements 

Sampling: In order to obtain a sufficient slice density, cf. Definition [lj we generate the set 
P in a quasi-random way as follows: For each direction p = 1 ,d and each index i p G T p 
we pick CsD r2 indices A,..., i p +i, ■ ■ ■, id at random (uniformly). This gives in total 
ffP = dnCsD r2 samples (excluding some exceptions), where C$d is the slice density from 
Definition [I] As a control set C, we use a set of the same cardinality as P that is generated 
in the same way. 

Stopping parameter: We give neither a limit to time nor to the number of iterations and 
use only the previously mentioned stopping criteria where the £ stop for ADF is always 1/3 the 
one for ALS. The different choices for £ stop are to compensate for the differing per-iteration 
computational complexity of each algorithm (and lead to a fair comparison). 

Order of optimization: Furthermore we use a slightly different order of optimization as 
previously discussed. Instead of the sweep we gave before (s = 1 ,...,d), we alternate 
between two sweeps (s = 1,..., h, s = d ,..., h, h — [d/ 2J) to enhance symmetry. A full 
alternating sweep (s = 1,..., d, s = d,, 1) can also be considered. However, we found 
that this sweep is slightly less effective. 

Notation: For the results of the tests we denote the ratio of known points p = ffP/n d , the 
relative residual resp = || A—X||p/||A||p, the error on the control set resc = ||A—A||c/||A||c 
and the time in seconds. In order to save space, we sometimes label the y-axis above plots. 


4.2 Approximation of a Full Rank Tensor with Decaying Singular 
Values 

As a first example, we consider a tensor A G M. 1 given by the entries 

A(h,...,id) := vj • (12) 


Remark 35 (Approximation by exponential sums) A good low-rank approximation of 
the aforementioned tensor A (12) can be obtained easily from the following observation. For 
any desired precision £ G (0,1) and R > 1 there is a k G d(log(e:) log(R)) such that 


Vr G [1, R] : 


1 

y/r 




i=l 


< £, 


(13) 


for specific values of u:*,a* that depend on the desired accuracy £ and upper bound R. The 
particular values can be obtained, cf. from the following webpage: 


http://www.mis.mpg.de/scicomp/EXP_SUM 
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To transfer this observation to the multidimensional case, we insert r 
and transform 

d 


e 


= e -E 


d 2 
1 



S=1 


x\\ 2 ,x £ {1,... ,n} d 


Since, in this case, r £ [d, dn 2 ], we rescale u = and a = 1 a* as well as require that 

R> n 2 . We finally obtain 


- 1/2 


A 


(*i>—,*d) 


<S =1 


k 

£ 

i=i 


LOi 


n 

S=1 


o-aiil 


This yields a TT format respresentation A = A G with square diagonal matrices 


= u ie - a ' m f (G s (m)) M = e~ aim , (G d (m)) a = e 


_ —aim 


for i = 1 ,..., k, s = 2,... ,d — l, m = 1,..., n, of rank r = (k ,..., k) with a maximal 
pointwise error of ^. A rank k approximation obtained in this way is not optimal in the 
sense that the same accuracy can be reached with a smaller rank. In order to find the near 
best approximation, we make use of the hierarchcial SVD (cf. w- In the first step we 
compute a highly accurate large rank tensor A £ TT(r), in the second step we determine the 
quasi-optimal approximation A £ TT{r), ||A — A\\ < \Jd — 1 inf BeTT(r) 
by truncation of A to rank r via the hierarchical SVD. 


\b -iy, cf. m\4y, 


We give convergence plots for varying target rank and slice density and also carry out 
four detailed, different tests, each one focusing on a different parameter: d (dimension), r 
(final rank), n (size) and C$d (slice density). In these tests we also compare the ADF with 
the ALS algorithm with stopping parameter e stop = 5 x 10 -5 (ADF), and £ stop = 15 x 10 -5 
(ALS). 

Each combination of parameters is tested 20 times for different random P and C, where 
the same random instance of these parameters P and C is used in both ALS and ADF 
tests. Furthermore ( resc) and ( resp) denote the geometric mean of the respective results 
and (time) the arithmetic mean of times. The values in brackets give the geometric vari¬ 
ance, respectively in case of the time the arithmetic variance. A plot of the convergence of 
(■ resp ), (resc) for fixed d = 7 , n = 12 and varying target rank as well as slice density is 
given in Figure [3j We observe convergence for all choices of parameters. In the Tables [lj 
[ 2 j [3] and [4] we list the detailed results of the four mentioned comparisons (left: ALS, right: 
ADF). 

First, we consider the variation of the dimension d £ {5, 6 , 7, 8,13, 21, 34, 55} in Table [l] 
For all dimensions d = 5,..., 55 the approximation seems to be uniformly good and the 
variance with respect to the randomness in the sampling points seems to be quite low. 

Remark 36 (Comparison with HTOpt) As a comparison of our results with the HTOpt 
algorithm from f2f\ I ~22] we perform the first three test of Table^f J i.e. dimension d £ (5, 6 , 7}, 
r — 3, n — 8 , Csd — 10. We have used the default values provided by the program but set the 
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relative residual on C 


relative residual on P 




Figure 3: (d — 7, r = 2,.. ., 8, n — 12 , Csd = 3, 10 , 20 ) Plotted are the residuals ( resp) 
(right) as well as the control residuals (■ resc) (left) as function of the sampling ratio 
p = dnr 2 CsD/n d for varying target ranks r — 2,... ,8 indicated by the respective 
symbols. Each curve corresponds to one choice of the slice density Csd-, for either 
ALS (dashed) or ADF (continous). 


maximal number of iterations to 1000. The following table shows the approximation quality 
on the control set C and given point set P, the accuracy of the near best exponential sum 
approximation (res exp ) from Remark 35. and the number of iterative steps: 


d 

(res c ) 

(resp) 


steps 

5 

6.9e-03 

2.fe-03 

2.3e-03 

519 

6 

2.7e-02 

2.6e-03 

1.5e-03 

750 

7 

5.2e-02 

4-8e-03 

1.0e-03 

579 


We can clearly see that the number of iterations in HTOpt used to find the approximation 
is rather stable. We have used the dense linear algebra version provided in MATLAB, but a 
sparse version is also available. R seems that for smaller dimension the optimization on the 
manifold yields an approximation close to the best one, whereas for larger dimension d the 
quality diminishes. 


In the second experiment we vary the target ranks r G {2,..., 8 } and report the results in 
Table [2j The approximation quality on the reference set P is as we expected (exponentially 
decaying to zero), but on the control set C the fixed slice density Csd limits the accuracy 
that we can achieve by the random sampling. 

In our third experiment we consider the variation of mode sizes n G {6,12,24,48}. The 
results are given in Table [3j We observe a rather slow increase of the error which can be 
attributed to the random sampling. 

In our fourth and last experiment we vary the slice density Csd £ (1,3,10,20,50}. The 
near best approximation, for d — 7, n = 12, r = 3, is obtained as in Remark 35 Its relative 
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d varying, r = 3, n — 8, Csd — 10 


ALS 

ADF 

d 

(res c ) 

(resp) 

(time) 

(res c ) 

(resp) (time) 

5 

2.9e-03(1.6) 9.6e-04(1.2) 0.1(0.0) 

2.9e-03(1.6) 9.6e-04(1.2) 0.1(0.0) 

6 

2.2e-03(1.8) 

4.9e-04(1.3) 

0 .2(0.0) 

2.2e-03(1.8) 

4.9e-04(1.3) 0.1(0.0) 

7 

1.2e-03(1.8) 3.1e-04(1.2) 0.3(0.1) 

1.2e-03(1.8) 3.1e-04(1.2) 0.2(0.1) 

8 

1.2e-03(2.0) 1.7e-04(1.2) 

0.4(0.1) 

1.2e-03(2.0) 

1.7e-04(1.2) 0.3(0.1) 

13 

1.8e-04(1.8) 3.5e-05(l.l) 1.7(0.4) 

1.8e-04(1.8) 3.5e-05(l.l) 1.0(0.2) 

21 

3.7e-05(1.6) 

7.5e-06(1.2) 

7.4(3.2) 

3.7e-05(1.6) 

7.4e-06(1.2) 4.0(1.9) 

34 

7.5e-06(1.6) 1.7e-06(l.l) 24.9(8.3) 

7.4e-06(1.6) 1.7e-06(l.l) 14.0(4.3) 

55 

1.4e-06(1.5) 3.7e-07(l.l) 

71.2(27.7) 

1.4e-06(1.5) 3.7e-07(l.l) 42.5(17.6) 


Table 1: Convergence and timing with respect to the dimension d for otherwise fixed 
parameters. 


residual is res exp = 1.34- ICC 3 . The results in Table |4] show that for Csd —>■ oo, i.e. sampling 
more and more entries of the tensor, the reconstruction gets closer and closer to the best 
rank r — 3 approximation of the tensor. Reasonably good results are already obtained for 
Csd = 3. Note that the relative residual on the sampling set is smaller than the optimal 
residual (due to overfitting). 


The tensor A from (12) is not suitable for a high-dimensional high rank tensor completion 


based on random samples, because the singular behavior is localized in one of the corners of 
the hypercube [0,_R]C In order to better investigate the approximation quality of ALS and 
ADF, we consider the tensor D £ given by the entries 


D 


(ii ,...,i d ) 



For all examples, we choose d = 7, n = 15, Csd = 10, £ s to P = 5 x ICC 4 (ADF) and 
e stop = 15 x 10 -4 (ALS). In our first experiment in Figure [4] we compare the runtime for 
ALS and ADF to reach a target accuracy for a rank rf ina i £ {6,8,10} approximation. In our 
second experiment in Figure [5] we repeat the experiment from Figure [4] and try to exclude any 
effects due to different choices of stopping parameters or initial guesses. For this, we start 
both iterations with the same initial guess of rank rfi na i — 1 obtained from ALS. Instead of 
the total time T we measure the relative time t re i := (T — Tf) /T\ with respect to the runtime 
Ti for rank ry ma / — 1 ALS. We observe that the ADF algorithm is consistently faster than 
ALS. The reason for this is that both iterations require a similar number of steps, but the 
complexity per step of ALS is inferior to that of ADF, cf. Lemma 22 and Lemma 28 These 


observations are highlighted in Figure |6j where we display the average number of iterations 
required until the next rank increase and the average measured time per step for each rank 
(for d = 7,7i = 15, rf ina i = 14, Csd — 10) of both ALS and ADF. The detailed timing results 
of the experiments are given in Table [5j averaged over 20 trials for each r £ (4,6,..., 14}. 
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d = 7, r varying, n = 12, Csd = 10 


ALS 

ADF 

r 

(■ resc) {resp) {time) 

{res c ) 

{resp) {time) 

2 

8.1e-03(1.3) 4.1e-03(1.2) O.l(O.O) 

8.1e-03(1.3) 

4.1e-03(1.2) 0.1(0.0) 

3 

1.9e-03(1.6) 3.8e-04(l.l) 0.6(0.1) 

1.9e-03(1.6) 

3.8e-04(l.l) 0.4(0.1) 

4 

5.8e-04(2.4) 3.6e-05(1.2) 9.0(2.7) 

5.7e-04(2.4) 

3.6e-05(1.2) 4.9(1.1) 

5 

4.6e-04(2.8) 4.3e-06(1.2) 80.4(27.3) 

5.0e-04(2.6) 

4.1e-06(1.2) 50.1(17.6) 

6 

3.0e-04(2.3) 7.9e-07(1.4) 260.6(72.2) 

2.9e-04(2.4) 

8.7e-07(1.2) 131.3(29.9) 

7 

1.6e-04(2.4) 1.5e-07(1.3) 761.9(124.4) 

1.5e-04(2.5) 

2.1e-07(1.2) 284.1(44.9) 

8 

1.8e-04(2.5) 3.4e-08(1.4) 1964.9(309.3) 

2.1e-04(2.4) 

8.1e-08(1.2) 555.2(68.2) 


Table 2: Convergence and timing with respect to the target rank r = rf ina i for otherwise 
fixed parameters. 


d = 7, r = 3 , n varying, Csd = 10 


ALS 

ADF 

n 

(■ res c ) 

(■ resp) 

{time) 

(■ res c ) 

{resp) 

{time) 

6 

9.2e-04(2.1) 

2.5e-04(1.2) 

0.2(0.1) 

9.2e-04(2.1) 

2.5e-04(1.2) 

0.1(0.1) 

12 

1.9e-03(1.6) 

3.8e-04(l.l) 

0.6(0.1) 

1.9e-03(1.6) 

3.8e-04(l.l) 

0.4(0.1) 

24 

3.4e-03(1.5) 

4.4e-04(l.l) 

1.5(0.4) 

3.4e-03(1.5) 

4.4e-04(l.l) 

1.0(0.3) 

48 

3.9e-03(1.5) 

5.5e-04(l.l) 

4.5(1.3) 

3.9e-03(1.5) 

5.5e-04(l.l) 

3.3(1.0) 


Table 3: Convergence and timing with respect to the mode sizes n for otherwise fixed 
parameters. 


4.3 Reconstruction of a Low Rank Tensor without Noise 

As second group of examples, we consider quasi-random tensors with exact, common low 
TT ranks A e TT(r,... ,r) (cf. Definition [2]) . Each quasi-random tensor is generated via 
a TT representation A = A G where we assign to each entry of each block G\ ,..., Gd a 
uniformly distributed random value in [—0.5, 0.5]. Each combination of parameters is tested 
20 times for different random P and C and stopping parameter £ st0 p := 5 x 10 -4 (ADF) 
and £ s top ■— 15 x 10 ~ 4 (ALS). We consider such a reconstruction successful if resc < 10~ 6 . 
First, we do not change the quasi-random tensor. In the test afterwards, we manipulate the 
singular values of the original quasi-random tensor. 

4.3.1 Quasi-random Tensors 

In the first test we consider the reconstruction of quasi-random tensors as described above. 
Since the rank is exactly r = 1,..., 8 it would in principle be possible to find a tensor of 
exactly rank r that interpolates the sampled points. However, due to the nature of the 
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d = 7 , r = 3, n — 12, Csd varying 


ALS 

ADF 

c 

(res c ) 

(resp) 

{time) 

(res c ) 

(resp) 

{time) 

1 

4.7e-03(1.8) 

4.2e-05(1.3) 

1.8(0.4) 

4.2e-03(1.9) 

3.4e-05(1.3) 

1.6(0.3) 

3 

2.8e-03(1.7) 

1.7e-04(1.2) 

0.7(0.2) 

2.8e-03(1.7) 

1.7e-04(1.2) 

0.4(0.1) 

10 

1.9e-03(1.6) 

3.8e-04(l.l) 

0.6(0.1) 

1.9e-03(1.6) 

3.8e-04(l.l) 

0.4(0.1) 

20 

2.0e-03(1.7) 

5.7e-04(1.2) 

0.8(0.2) 

2.0e-03(1.7) 

5.7e-04(1.2) 

0.5(0.1) 

50 

1.4e-03(1.5) 

7.2e-04(l.l) 

1.1(0.1) 

1.4e-03(1.5) 

7.2e-04(l.l) 

0.8(0.1) 


Table 4: Convergence and timing with respect to the slice density Csd for otherwise fixed 
parameters. 


d = 7, r varying, n = 15, Csd = 10 


ALS 

ADF 

r 

(■ resc) {resp) {time) 

{res c ) {'resp) 

{time) 

4 

3.5e-03(1.0) 3.1e-03(1.0) 0.6(0.2) 

3.5e-03(1.0) 3.1e-03(1.0) 

0.3(0.1) 

6 

9.6e-04(1.0) 8.1e-04(1.0) 6.8(2.4) 

9.6e-04(1.0) 8.1e-04(1.0) 

1.4(0.0) 

8 

1.9e-04(1.0) 1.4e-04(1.0) 64.7(2.5) 

1.9e-04(l.l) 1.4e-04(1.0) 

5.2(0.2) 

10 

6.3e-05(1.0) 4.9e-05(1.0) 133.4(5.0) 

6.3e-05(1.0) 4.9e-05(1.0) 

15.2(0.5) 

12 

2.4e-05(l.l) 1.5e-05(1.0) 466.7(15.8) 

2.4e-05(l.l) 1.5e-05(1.0) 

34.9(0.8) 

14 

7.8e-06(l.l) 4.6e-06(1.0) 1700.0(112.5) 

7.9e-06(l.l) 4.6e-06(1.0) 

94.6(5.0) 


Table 5: Convergence and timing with respect to the target rank r = Vf ina i for otherwise 
fixed parameters. 


random sampling and possible local minima we do not always reconstruct the tensor. The 
number of successful reconstructions for 20 random tensors is displayed in 20 shades of gray, 
from white (0) to black (all 20). In Figure [7] for d = 4 and d = 5 we observe that both ALS 
and ADF are able to reconstruct the tensor (with known target rank r) provided that the 
slice density is high enough. For larger ranks r it seems that a slice density of Csd = 4 is 
enough, but for smaller ranks the slice density has to be larger in order to compensate for 
the randomness in both the tensor as well as the sampling set P. 

4.3.2 Quasi-random Tensors with Decaying Singular Values 

We base the second group of tests for random tensors on the same quasi-random tensors as 
above. However, for each tensor and each matricization we enforce the singular values to 
decay exponentially (that is Uj = 10~*) by rescaling them. We therefore alternatingly adapt 
the singular values of the according matricizations of the random tensor. Note that this can 
be done indirectly via the given representation of the random tensor. The difference to the 
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Figure 4: (d = 7, r = 6, 8,10, n = 15, Csd = 10) Plotted are, for varying target ranks 
rfinal = 6,8,10, the residual resp (dashed) as well as the control residual resc 
(continous) as functions of the total time (in seconds) for one trial, for ALS (black, 
upper curves) and ADF (blue, lower curves). 


previous group of tests is that now the smaller singular values are dominated by the large 
ones. The results in Figure [8] show a similar behaviour as before with the exception that, 
with respect to reconstruction capability, ADF performs slightly worse in d = 4 and worse 
in dimension d = 5. 


4.3.3 Quasi-random Tensors with Decaying Singular Values and Gap 


The third group of tests is again based on the quasi-random tensors of Subsection 4.3.2 This 


time, the singular values of each matricization of each tensor are rescaled to = 10 _l for 
i < r/2 and = 10 - * -2 for i > r / 2 , i.e. , there is a gap in the singular values after the first 
r/2 singular values. We illustrate the results by two diagrams in Figure |9j in which we plot 
the residuals resp and resc on the y-axis against the elapsed time on the x-axis. We fix the 
dimension d = 5, r = 8, the mode size n = 12, and the slice density Csd = 10. The dashed 
vertical and horizontal lines mark the points at which the rank is increased and are labelled 
on the x-axis with the corresponding (higher) rank. 

We observe that the gap in the singular values is clearly apparent in the approximation 
quality of the reconstruction, both in the given sample set P as well as the control set C. 
Each of the residuals drops by three orders of magnitude if the rank approaches r/2. Also, 
we can see that the residuals in P and C are almost the same, which is most likely a special 
property of random tensors. The comparison shows a clear advantage of the ADF iteration 
over the ALS iteration with respect to timing. 


4.4 Reconstruction of a Low Rank Tensor with Noise 


In the fourth group of tests we repeat the ones from Subsection T3 but with perturbed tensors 
A = A + 10 -4 z/£, where A is generated as before and v := \\A\\p/y/#P. The perturbation 
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relative time, r final = 6 relative time, r final = 8 relative time, r final = 10 

Figure 5: (d = 7, r = 6, 8,10, n = 15, Csd = 10) Plotted are, for varying target ranks 
Rfinal = 6,8,10, the residual resp (dashed) as well as the control residual resc 
(continous) as functions of the relative time t re i for one trial, for ALS (black, upper 
curves) and ADF (blue, lower curves). Both methods start with the same initial 
guess of rank rf ina i — 1 obtained by ALS. 


£ is a tensor of the same proportions as A and without any prescribed rank structure. Each 
of its entries is assigned a uniform random value in [—1,1], A test is considered successful 
if resc < 10~ 3 , where the control set residual is evaluated for A and not A. However, no 
information about the non perturbed tensor is used in the algorithm. The results are identical 
to those of Subsection |4.3[ i.e. the perturbation has no influence on the reconstruction as long 


as the magnitude is below the target accuracy. We do not yet have a theoretical justification 
for this very pronounced effect and believe that a thorough analysis might reveal more insight. 


4.5 Stochastic Elliptic PDE with Karhunen-Loeve Expansion 

Our last numerical example is a tensor completion problem based on an elliptic PDE with 
stochastic coefficient a, 


-div(a(x,y)Vu(x,y)) = f(x), (x,y) e D x 0, 

u(x,y ) = 0 (x,y) G dD x O, 

where y E O is a random variable and D = [—1,1]. The goal is to determine the expected 
value of the average of the solutions u(y) := f D u(x, y) dx. We follow the procedure described 
in [ISjUnj where first the stochastic coefficient is replaced by a truncated d+1-term Karhunen- 
Loeve (KL) expansion. Subsequently the solution space over the computational domain D 
is discretised by finite elements and the d stochastic independent variables are sampled on 
a uniform grid, which yields averaged solutions := u(ii ,..., id) depending on the 

parameters i^. For each parameter combination (ii,... ,id), a deterministic problem has to 
be solved and the average over all solutions gives the sought expected value. In this example, 
we choose f(x) = 1 and use a finite element space with m = 50 degrees of freedom. 
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Figure 6: (d — 7, n — 15, rf ina i = 14, Csd = 10) Plotted are the average number of iterations 
required until the next rank increase (left) and the average measured time per step 
for each rank (right) for ALS (black) and ADF (blue). 


In Figure [To] we display the convergence for algebraically decaying KL eigenvalues \f\, = 
(1 + p)~ 2 , final rank rf ina i G {4,6,8} for dimension d — 5, slice density Csd — 6 and 
stopping parameter £ stop := 5 x 10~ 4 (ADF) and £ stop 15 x 10 -4 (ALS). We observe 
that both methods eventually find a completed tensor of comparable approximation quality, 
both in terms of the residual on P and on C. The ADF iteration is consistently faster, 
and with increasing rank ry ma / one can clearly see the advantage of the asymptotically lower 
complexity per step. In the tests in Figure 11 we try to exclude any effects due to different 
choices of stopping parameters or initial guesses. For this, we start both iterations with 
the same initial guess of rank rf ino j — 1 obtained from ALS. Instead of the total time T we 
measure the relative time t re i := {T — T\) /T\ with respect to the runtime T\ for rank rf ma i — 1 
ALS. Again, we observe that ADF is consistently faster. 


4.6 C Implementation 

The C implementation of the ALS and ADF algorithm, which was used for the latter results, 
can be found at 


http://www.igpm.rwth-aachen.de/personen/kraemer 


5 Conclusions 

In this article, we presented two variants of an alternating least squares algorithm that aim at 
finding a low tensor rank approximation to a tensor whose entries are known only in a small 
subset of all indices. It is important to use a certain oversampling factor, respectively slice 
density Csd, hr order to obtain a reasonable reconstruction of the tensor. In our numerical 
experiments it turns out that this factor depends on the dimension but can be decreased 
with increasing rank. We obtain successful results already for the almost minimal value 
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Figure 7: (d = 4,5 , r varying, n = 12, Csd varying, constant singular values) Displayed as 
shades of gray (white (0) to black (all 20)) are the number of successful reconstruc¬ 
tions for varying target ranks r = 1,..., 8 and slice densities Csd — 2,4,..., 256 
for ALS and ADF. 
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Figure 8: (d = 4,5, r varying, n = 12, Csd varying, decaying singular values) Displayed as 
shades of gray (white (0) to black (all 20)) are the number of successful reconstruc¬ 
tions for varying target ranks r = 1,..., 8 and slice densities Csd = 2,4,..., 256 
for ALS and ADF. 


Csd = 2. Both, the SOR-type solver ADF as well as the simple (and well known) alternating 
least squares method ALS are able to find reconstructions or approximations for moderate 
rank r = 1,..., 14 and dimension d = 3,..., 55. From our experiments we recommend to 
use the faster ADF algorithm, because the advantage of the 0{r 2 d^P) scaling over the 
0(r 4 d#P) scaling of ALS is already visible for rank r = 3. A modification or extension is 
necessary in order to treat varying TT ranks r\,..., r ( i-\ instead of a uniform rank. Also, 
large mode sizes n > 100 possibly require smoothness conditions and a refined sampling 
strategy. The influence of noise on the reconstruction is rather harmless, where the noise 
can be unstructured or of rank structure but of smaller magnitude than the desired target 
accuracy. It seems that the low rank format introduces an automatic regularization in the 
same way as the singular value truncation Liters high frequency components. 
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relative residual on C and P 


relative residual on C and P 




Figure 9: (d — 5, r = 8, n — 12, Csd = 10) Plotted are the relative residuals for one trial 
of a reconstruction of a tensor with a gap in its exponentially decaying singular 
values for ALS (left) and ADF (right). Additionally, circles and accordant, dashed 
lines as well as the x-axis label indicate when the algorithm automatically increases 
ranks. 
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